This case is about a bank (Thera Bank) whose management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget
The dataset contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
The classification goal is to predict the likelihood of a liability customer buying personal loans.
Exploratory Data Analysis
Preparing the data to train a model
Training and making predictions using a classificationmodel
Model evaluation
Banking
ID: Customer ID
Age: Customer's age in completed years
Experience: #years of professional experience
Income: Annual income of the customer ($000)
ZIP Code: Home Address ZIP code.
Family: Family size of the customer
CCAvg: Avg. spending on credit cards per month ($000)
Education: Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage: Value of house mortgage if any. ($000)
Personal Loan: Did this customer accept the personal loan offered in the last campaign?
Securities Account: Does the customer have a securities account with the bank?
CD Account: Does the customer have a certificate of deposit (CD) account with the bank?
Online: Does the customer use internet banking facilities?
Credit card: Does the customer use a credit card issued by the bank?
1.Import the datasets and libraries, check datatype, statistical summary,shape,null values or incorrect imputation. (5 marks)
2.Study the data distribution in each attribute and target variable, share your findings
3.Split the data into training and test set in the ratio of 70:30 respectively (5marks)
4.Use Logistic Regression model to predict whether the customer will take personal loan or not. Print all the metrics related for evaluating the model performance (15 marks)
5.Check different parameters of Logistic Regression and give your reasoning whether the modelperformance is affected due to it or not?(10 marks)
6.Give Business understanding of your model? (5 marks)
import warnings
warnings.filterwarnings('ignore')
import os,sys
import pandas as pd
import numpy as np
from scipy import stats
# importing ploting libraries
import matplotlib.pyplot as plt
#importing seaborn for statistical plots
import seaborn as sns
sns.distributions._has_statsmodels=False
# To enable plotting graphs in Jupyter notebook
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# calculate accuracy measures and confusion matrix
from sklearn import metrics
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, classification_report, confusion_matrix,auc
from sklearn.metrics import recall_score,precision_score, f1_score,accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, accuracy_score, roc_auc_score
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
# Set color code,font scale..
sns.set(color_codes=True,rc={'figure.figsize':(35.0,35.0)},font_scale=1)
%matplotlib inline
#Bank personal Loan Data
Bank_Personal_DF = pd.read_csv('Bank_Personal_Loan_Modelling.csv')
Bank_Personal_DF.shape
#Can Check if it there is any Null Values with below way
print(Bank_Personal_DF.info())
#SHAPE
print(Bank_Personal_DF.shape)
#Describe-FIVE POINTS SUMMARY
print(Bank_Personal_DF.describe())
print(Bank_Personal_DF.isnull().any())
print(Bank_Personal_DF.isnull().sum().sum())
display(Bank_Personal_DF.isnull().sum())
Bank_Personal_DF.head()
#Lets analysze the distribution of the various attribute
Bank_Personal_DF.describe().T
Experience contains some negative values and experience should not be Negative
The max of Income, Experience, CCAvg, Mortgage, Security Account, CD Accounts, CreditCard is much high then their mean it means they contains some extream values
#ID can be dropped as it is unique with all 5000 Values
Bank_Personal_DF.drop('ID', axis=1, inplace=True)
## Experience should not be negative-Replacing with Median
Bank_Personal_DF['Experience'] = Bank_Personal_DF['Experience'].map(lambda x : Bank_Personal_DF['Experience'].median() if x < 0 else x)
Bank_Personal_DF['Experience'].describe()
# remove the rows of data which have missing value(s)
#No Null Values ..Below statement not required
Bank_Personal_DF = Bank_Personal_DF.dropna()
#Creating Profile Report for Analysis
#!pip install pandas_profiling
import pandas_profiling
Bank_Personal_DF.profile_report()
#Unique Values
display(Bank_Personal_DF.nunique())
By checking Above numbers of we can sepeate values as the continuous and categorical data
category_variables=[col for col in Bank_Personal_DF.columns if Bank_Personal_DF[col].nunique()<=5]
print("Category Variables")
print(category_variables)
print()
print("Contniue Variables")
continue_variables=[col for col in Bank_Personal_DF.columns if Bank_Personal_DF[col].nunique()>5]
print(continue_variables)
#As we are analyzing about Personal Loan it can be removed
category_variables.remove("Personal Loan")
print(category_variables)
Bank_Personal_DF_ZeroMortage = Bank_Personal_DF[Bank_Personal_DF['Mortgage']==0]
print(Bank_Personal_DF_ZeroMortage.shape[0])
display(Bank_Personal_DF_ZeroMortage)
Bank_Personal_DF_CreditCard = Bank_Personal_DF[Bank_Personal_DF['CCAvg']==0]
print(Bank_Personal_DF_CreditCard.shape[0])
display(Bank_Personal_DF_CreditCard)
## category columns
#Zip code is numerical but it not continuous and it can be like a category because zip code is defining a region
# and the region should be categorical value.
category_variables = ['ZIP Code', 'Family', 'Education', 'Personal Loan', 'Securities Account', 'CD Account', 'Online', 'CreditCard']
for i in category_variables:
display(i)
display(Bank_Personal_DF[i].value_counts(normalize=True))
fig=plt.figure(figsize=(20,10))
for i,col in enumerate(continue_variables):
ax=fig.add_subplot(3,3,i+1)
sns.distplot(Bank_Personal_DF[col])
Age and Experience are uniformaly distributed and show a good similarities in distribution.
Income, CCAvg, Mortgage are positively Skewed
ZIP code is negative Skew or it contain values from single region.
Mortgage contain most of the values = 0
fig=plt.figure(figsize=(15,10))
for i,col in enumerate(category_variables):
ax=fig.add_subplot(3,3,i+1)
sns.countplot(Bank_Personal_DF[col])
Most customer doesn't have Securities Account, CD Account and CreditCard
More customers use internet banking facilities.
More customers are Undergraduate and have family size one.
Bank_Personal_DF.dtypes
Bank_Personal_DF['Family'] = Bank_Personal_DF['Family'].astype('category')
Bank_Personal_DF['Education'] = Bank_Personal_DF['Education'].astype('category')
Bank_Personal_DF['Personal Loan'] = Bank_Personal_DF['Personal Loan'].astype('category')
Bank_Personal_DF['Online'] = Bank_Personal_DF['Online'].astype('category')
Bank_Personal_DF['CreditCard'] = Bank_Personal_DF['CreditCard'].astype('category')
Bank_Personal_DF.dtypes
# Credit Card V/s Personal Loan
sns.countplot(x=Bank_Personal_DF['CreditCard'],hue= Bank_Personal_DF['Personal Loan']);
Bank_Personal_DF['CreditCard'].value_counts(normalize = True)
#CD Account V/s Personal Loan
sns.countplot(x=Bank_Personal_DF['CD Account'],hue= Bank_Personal_DF['Personal Loan']);
Bank_Personal_DF['CD Account'].value_counts(normalize = True)
sns.countplot(x=Bank_Personal_DF['Family'],hue= Bank_Personal_DF['Personal Loan']);
sns.countplot(x=Bank_Personal_DF['Income'],hue= Bank_Personal_DF['Personal Loan']);
sns.countplot(x=Bank_Personal_DF['Education'],hue= Bank_Personal_DF['Personal Loan']);
Professionals are more viable to buy personal loans from a bank rather than people who are under-graduated.
Bank_Personal_DF['Mortgage'].unique()
Bank_Personal_DF['Mortgage'].hist();
Bank_Personal_DF['Income'].hist(bins=100);
Bank_Personal_DF['Income'][Bank_Personal_DF['Personal Loan']==1].hist(bins=10)
Bank_Personal_DF['Income'][Bank_Personal_DF['Personal Loan']==0].hist(bins=10)
Bank_Personal_DF['Education'].value_counts()
sns.pairplot(Bank_Personal_DF,hue='Personal Loan');
plt.figure(1)
plt.subplot(131)
Bank_Personal_DF['Education'].value_counts(True).plot.bar(figsize=(24,6),title = 'Education')
plt.subplot(132)
Bank_Personal_DF['Family'].value_counts(True).plot.bar(figsize=(24,6),title = 'Family');
sns.boxplot(x="Mortgage", data=Bank_Personal_DF)
Mortgage has more outliers as we can see in the boxplot
fig=plt.figure(figsize=(20,10))
for i,col in enumerate(continue_variables):
ax=fig.add_subplot(3,3,i+1)
ax1=sns.distplot(Bank_Personal_DF[col][Bank_Personal_DF['Personal Loan']==0],hist=True,label='No Personal Lone')
sns.distplot(Bank_Personal_DF[col][Bank_Personal_DF['Personal Loan']==1],hist=True,ax=ax1,label='Personal Lone')
sns.catplot(x='Family', y='Income', hue='Personal Loan', data = Bank_Personal_DF, kind='swarm')
Customers who have family 3 or greater with higher income between 100 to 200K are more chances to take loan.
plt.figure(figsize = (15,7))
plt.title('Correlation of Data Columns', y=1.05, size=19)
sns.heatmap(Bank_Personal_DF.corr(), annot=True, fmt='.2f',cmap="YlGnBu")
Age and Experience are highly correlated and the correlation is almost 1.
'Income' and 'CCAvg' is moderately correlated.
Personal Loan has maximum correlation with 'Income', 'CCAvg', 'CD Account', 'Mortgage', and 'Education'.
Heat map there is association of 'CD Account' with 'Credit Card', 'Securities Account', 'Online', 'CCAvg' and 'Income'.
'Income' influences 'CCAvg', 'Personal Loan', 'CD Account' and 'Mortgage'.
sns.countplot(Bank_Personal_DF['Personal Loan'])
## Contingency table to get Approved %
def get_contingency_table(df,target,var):
ct_res = pd.crosstab(df[var],df[target],margins=True)
ct_res['Approved (%)']=round(ct_res[1]/ct_res['All']*100,2)
return ct_res.drop(columns=['All'])
get_contingency_table(Bank_Personal_DF,'Personal Loan','Education')
get_contingency_table(Bank_Personal_DF,'Personal Loan','Family')
Count of family members not significantly affect probability.
get_contingency_table(Bank_Personal_DF,'Personal Loan','CreditCard')
customer with credit card by bank doesn’t seem to affect the probability of buying a personal loan.
get_contingency_table(Bank_Personal_DF,'Personal Loan','Online')
customer uses or doesn’t use internet banking facilities seems to not affect the probability of buying personal loans.
get_contingency_table(Bank_Personal_DF,'Personal Loan','Securities Account')
get_contingency_table(Bank_Personal_DF,'Personal Loan','CD Account')
customer with CD account has more chance of Personal loan buying
get_contingency_table(Bank_Personal_DF,'Personal Loan','CreditCard')
get_contingency_table(Bank_Personal_DF,'Personal Loan','CCAvg')
Bank_Personal_DF.groupby('Personal Loan')['CCAvg'].mean().plot(kind='bar')
Bank_Personal_DF.groupby('Personal Loan')['Income'].mean().plot(kind='bar')
n_true = len(Bank_Personal_DF.loc[Bank_Personal_DF['Personal Loan'] == 1])
n_false = len(Bank_Personal_DF.loc[Bank_Personal_DF['Personal Loan'] == 0])
print("Number of true cases: {0} ({1:2.2f}%)".format(n_true, (n_true / (n_true + n_false)) * 100 ))
print("Number of false cases: {0} ({1:2.2f}%)".format(n_false, (n_false / (n_true + n_false)) * 100))
def get_stratified_ct(df,stra_var):
ct_res = pd.crosstab(index=[df[stra_var],df['CD Account']],columns= df['Personal Loan'],margins=True)
ct_res['Approved (%)']=round(ct_res[1]/ct_res['All']*100,2)
return ct_res.drop(columns='All')
get_stratified_ct(Bank_Personal_DF,'Family')
get_stratified_ct(Bank_Personal_DF,'CCAvg')
print(Bank_Personal_DF['Mortgage'].mean())
Bank_Personal_DF['Mortgage'].std()
Bank_Personal_DF.boxplot(column=['Mortgage'], return_type='axes');
#As we have lot of OutLiers for Mortgage-We can remove by using Z-Score
Bank_Personal_DF['Mortgage_Score']=np.abs(stats.zscore(Bank_Personal_DF['Mortgage']))
Bank_Personal_DF=Bank_Personal_DF[Bank_Personal_DF['Mortgage_Score']<3]
Bank_Personal_DF.drop('Mortgage_Score',axis=1,inplace=True)
Bank_Personal_DF.shape
Bank_Personal_DF[['Age','Experience','Personal Loan']].corr()
ANd compared to Experience ,Age shows a little better correlation with Personal loan
#We can Drop ‘ZIP Code’ & ‘Experience’ columns in analysis since ‘ZIP Code’ are just numbers of series
#& ‘Experience’ is highly correlated with ‘Age’ as per Profile Report.
#Zip code is numerical but it not continuous and it can be like a category because zip code is defining a region
# and the region should be categorical value.
#If we keep Zip code as the numerical value then a higher value would be given higher preference
#but no one should get preference based on their region
#Here there too many categories in the zip code so dropping zip code
Bank_Personal_DF.drop('Experience', axis=1, inplace=True)
Bank_Personal_DF.drop('ZIP Code', axis=1, inplace=True)
Bank_Personal_DF.head()
Bank_Personal_DF.drop_duplicates(inplace=True)
Bank_Personal_DF.shape
Bank_Personal_DF.head(10)
X = Bank_Personal_DF.drop('Personal Loan', axis=1)
y = Bank_Personal_DF['Personal Loan']
#Convert categorical vriables to dummy variables
X = pd.get_dummies(X, drop_first=True)
X.head()
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.30,random_state=1)
print(X_train.shape)
print(X_test.shape)
print('original data, Personal loan:{0} ({1:0.5f}%)'.format(y[y==1].shape[0], 100*y[y==1].shape[0]/y.shape[0]))
print('original data, Personal loan:{0} ({1:0.5f}%)'.format(y[y==0].shape[0], 100*y[y==0].shape[0]/y.shape[0]))
print('-----------------------')
print('training data , Personal loan:{0} ({1:0.5f}%)'.format(y_train[y_train==1].shape[0], 100*y_train[y_train==1].shape[0]/y_train.shape[0]))
print('training data , Personal loan:{0} ({1:0.5f}%)'.format(y_train[y_train==0].shape[0], 100*y_train[y_train==0].shape[0]/y_train.shape[0]))
print('-----------------------')
print('testing data , Personal loan:{0} ({1:0.5f}%)'.format(y_test[y_test==1].shape[0], 100*y_test[y_test==1].shape[0]/y_test.shape[0]))
print('testing data , Personal loan:{0} ({1:0.5f}%)'.format(y_test[y_test==0].shape[0], 100*y_test[y_test==0].shape[0]/y_test.shape[0]))
#Build the logistic regression model
import statsmodels.api as sm
logit = sm.Logit(y_train, sm.add_constant(X_train))
lg = logit.fit()
#Summary of logistic regression
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
print(lg.summary2())
A pseudo R^2 of 63 % indicates that 63 % of the uncertainty of the intercept only model is explained by the full model
#Calculate Odds Ratio, probability
##create a data frame to collate Odds ratio, probability and p-value of the coef
lgcoef = pd.DataFrame(lg.params, columns=['coef'])
lgcoef.loc[:, "Odds_ratio"] = np.exp(lgcoef.coef)
lgcoef['probability'] = lgcoef['Odds_ratio']/(1+lgcoef['Odds_ratio'])
lgcoef['pval']=lg.pvalues
pd.options.display.float_format = '{:.2f}'.format
# FIlter by significant p-value (pval <0.1) and sort descending by Odds ratio
lgcoef = lgcoef.sort_values(by="Odds_ratio", ascending=False)
pval_filter = lgcoef['pval']<=0.05
lgcoef[pval_filter]
X_train.isnull().sum()
scalar = preprocessing.StandardScaler()
scalar.fit(X_train)
X_train=scalar.transform(X_train)
scalar.fit(X_test)
X_test=scalar.transform(X_test)
logreg = LogisticRegression(solver='liblinear', random_state=1)
logreg.fit(X_train, y_train)
# train data
pred_train = logreg.predict(X_train)
cm_train = confusion_matrix(y_train, pred_train)
print('Train Data-confusion_matrix = \n',cm_train)
# test data
pred_test = logreg.predict(X_test)
cm_test = confusion_matrix(y_test, pred_test)
print('Test Data-confusion_matrix = \n',cm_test)
y_pred=logreg.predict(X_test)
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
lg_prob=logreg.predict_proba(X_test)
lg_prob
fp,tp,th=roc_curve(y_test,lg_prob[:,1])
roc=auc(fp,tp)
print(roc)
## function to get confusion matrix in a proper format
def draw_matrix( actual, predicted ):
cm = confusion_matrix( actual, predicted,[1,0])
sns.heatmap(cm, annot=True, fmt='.2f',cmap="YlGnBu", xticklabels = [1,0] , yticklabels = [1,0] )
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show()
print("Trainig accuracy",logreg.score(X_train,y_train))
print()
print("Testing accuracy",logreg.score(X_test, y_test))
print()
print('Confusion Matrix')
print(draw_matrix(y_test,y_pred))
print()
print("Recall:",recall_score(y_test,y_pred))
print()
print("Precision:",precision_score(y_test,y_pred))
print()
print("F1 Score:",f1_score(y_test,y_pred))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_pred))
#AUC ROC curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
threshold = 0.5
preds = np.where(logreg.predict_proba(X_test)[:,1]>threshold,1,0)
print(draw_matrix(y_test,pred_test))
print()
print("Recall:",recall_score(y_test,preds))
print()
print("Precision:",precision_score(y_test,preds))
print()
print("F1 Score:",f1_score(y_test,preds))
print()
print("Roc Auc Score:",roc_auc_score(y_test,preds))
# Checking Parameters of logistic regression
logreg.get_params()
#If we dont specify the parameters in the model it takes default value
## print a list of floating numbers
def printflist(x):
for i in x:
print('{:.3f}'.format(i), end=' ')
print()
train_scores=[]
test_scores=[]
recall_scores=[]
precision_scores=[]
f1_scores=[]
roc_auc_scores=[]
# The ‘newton-cg’, ‘sag’, and ‘lbfgs’ solvers support only
# L2 regularization with primal formulation, or no regularization.
# The ‘liblinear’ solver supports both L1 and L2 regularization,
# with a dual formulation only for the L2 penalty.
# The Elastic-Net regularization is only supported by the ‘saga’ solver.
solvers = ['lbfgs', 'sag', 'saga', 'newton-cg', 'liblinear']
for i in solvers:
md = LogisticRegression(random_state=1, solver=i, max_iter=1000, penalty='l2')
md.fit(X_train, y_train)
y_predict = md.predict(X_test)
train_scores.append(md.score(X_train, y_train))
test_scores.append(md.score(X_test, y_test))
recall_scores.append(recall_score(y_test, y_predict))
precision_scores.append(precision_score(y_test, y_predict))
f1_scores.append(f1_score(y_test, y_predict))
roc_auc_scores.append(roc_auc_score(y_test, y_predict))
print(solvers)
printflist(train_scores)
printflist(test_scores)
printflist(recall_scores)
printflist(precision_scores)
printflist(f1_scores)
printflist(roc_auc_scores)
train_scores=[]
test_scores=[]
recall_scores=[]
precision_scores=[]
f1_scores=[]
roc_auc_scores=[]
solvers = ['saga','liblinear'] # changing values of solver which works with 'l1'
for i in solvers:
md = LogisticRegression(random_state=1, solver=i, max_iter=1000, penalty='l1')
md.fit(X_train, y_train)
y_predict = md.predict(X_test)
train_scores.append(md.score(X_train, y_train))
test_scores.append(md.score(X_test, y_test))
recall_scores.append(recall_score(y_test, y_predict))
precision_scores.append(precision_score(y_test, y_predict))
f1_scores.append(f1_score(y_test, y_predict))
roc_auc_scores.append(roc_auc_score(y_test, y_predict))
print(solvers)
printflist(train_scores)
printflist(test_scores)
printflist(recall_scores)
printflist(precision_scores)
printflist(f1_scores)
printflist(roc_auc_scores)
train_scores=[]
test_scores=[]
recall_scores=[]
precision_scores=[]
f1_scores=[]
roc_auc_scores=[]
C = [0.01, 0.1, 0.25, 0.5, 0.75, 1] ## testing different C values
for c in C:
md = LogisticRegression(solver='liblinear', penalty='l1', max_iter=1000, random_state=1, C=c)
md.fit(X_train, y_train)
y_predict = md.predict(X_test)
train_scores.append(md.score(X_train, y_train))
test_scores.append(md.score(X_test, y_test))
recall_scores.append(recall_score(y_test, y_predict))
precision_scores.append(precision_score(y_test, y_predict))
f1_scores.append(f1_score(y_test, y_predict))
roc_auc_scores.append(roc_auc_score(y_test, y_predict))
printflist(C)
printflist(train_scores)
printflist(test_scores)
printflist(recall_scores)
print("Precision Start")
printflist(precision_scores)
print("Precision End")
printflist(f1_scores)
printflist(roc_auc_scores)
model = LogisticRegression(solver='liblinear', penalty='l1', max_iter=1000, random_state=1)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print("Trainig accuracy",model.score(X_train,y_train))
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print('Confusion Matrix')
print(draw_matrix(y_test,y_predict))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
## change class widgest to 'balanced'
model = LogisticRegression(solver='liblinear', penalty='l1', max_iter=1000, random_state=1, class_weight='balanced')
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print("Trainig accuracy",model.score(X_train,y_train))
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print('Confusion Matrix')
print(draw_matrix(y_test,y_predict))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
# c=0.1 giving good results with high precision value
model = LogisticRegression(solver='liblinear', penalty='l1', max_iter=1000, random_state=1, C=0.1)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print("Trainig accuracy",model.score(X_train,y_train))
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print('Confusion Matrix')
print(draw_matrix(y_test,y_predict))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
!pip install yellowbrick
from yellowbrick.classifier import ClassificationReport, ROCAUC
# Visualize model performance with yellowbrick library
viz = ClassificationReport(logreg)
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
roc = ROCAUC(logreg)
roc.fit(X_train, y_train)
roc.score(X_test, y_test)
roc.show();
Experience is highly correlated with Age High correlation
ID has unique values Unique
CCAvg has 106 (2.1%) zeros
Mortgage has 3462 (69.2%) zeros
93% of the customer doesn’t have a certificate of deposit (CD) account with the bank.
70% of the customer doesn’t use a credit card issued by UniversalBank.
59% of customers use internet banking facilities.
90% of the customer doesn’t accept the personal loan offered in the last campaign.
89% of the customer doesn’t have a securities account with the bank.
42% of the candidates are graduated, while 30% are professional and 28% are Undergraduate.
29% of the customer’s family size is 1.
94% of the customer doesn’t have a certificate of deposit (CD) account with the bank.
71% of the customer doesn’t use a credit card issued by UniversalBank.
Means It Predicted that customer will buy personal loan and the actually customer did buy.
Means It Predicted that the customer will buy personal loan but the actually customer did not.
Means It Predicted that the customer will not buy personal loan and the actually customer did not buy personal loan.
Means It Predicted that the customer will not buy personal loan but the actually customer did buy personal loan.
Need model to identify True Positive customers as target customers.i.e
Need to run model to find the possible customers as target customers (Predicted Positive = TP + FN), which more customers really buying personal loans (True Positive / TP) higher will be the success ratio.